Explore and summarize Red Wine data by Sai Raj Reddy

Dataset Introduction

This is a clean and wrangled dataset based on research by Cortez et al., 2009 to explore and mine quality of red wine. This dataset focuses on red variant of of the Portuguese “Vinho Verde” wine. The dataset is designed in such a way that various psychochemical features determine the quality of wine which is represented by sensory feature.

str(red_wine_data)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Univariate Plots Section

Univariate Analysis

What is the structure of your dataset?

  • The dataset contains 1599 rows
  • 12 input variables and 1 output variable. All the variables are numeric.

What is/are the main feature(s) of interest in your dataset?

  • The output variable ‘quality’ which is based on the sensory data.
  • Major interesting features to explore are :- pH, Alchohol and acidity.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

  • Sulphates and density seems to be interesting and are good fit for further analysis.

Did you create any new variables from existing variables in the dataset?

  • Yes, I created a new variable for total amount of acidity.
  • The fact that the volatile acidity in wines are variably small is proven with this new variable.
  • The variation of total acidity is similar to variation of fixed acidity, since volatile acidity in all the wines are very small.

Bivariate Plots Section

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • We can observe that higher the quality of wine, higher is its alcohol presence.
  • The box plots convey how the quantity of alcohol increases from its average point as the quality increases.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, the relationship that is established between pH and density. They are inversely proportional.

What was the strongest relationship you found?

  • The direct relationship between pH and fixed acidity.
  • They are directly proportional.

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  • Yes, we can observe that in the second plot.
  • Higher is the alcohol content, lower the density of wine.
  • Considering the total acidity as the color scale to the alcohol, density graph. We see that the total acidity is higher in the wine with high density.
  • It is also spread across the wine with various levels of alcohol presence.

Were there any interesting or surprising interactions between features?

  • Yes the first plot is interesting to check how the lower quality wine categories have steeper slope.
  • The low rated wines are found more in the high density and low alcohol area i.e. top-left.
  • The high rated wines are concentrated in low density high alcohol i.e bottom-right.

Final Plots and Summary

Plot One

Description One

  • The output variable of this dataset is ‘quality’.
  • This gives us the overall picture of how the input can affect the quality.
  • Therefore quality can be considered as the basis for any multivariate or bivariate analysis from this point.

Plot Two

Description Two

  • This shows the direct relationship between pH and fixed acidity.
  • The variables – pH and Fixed Acidity have an inverse relationship. More the pH, lesser the fixed acidity and vice versa.

Plot Three

Description Three

  • The low rated wines are found more in the high density and low alcohol area i.e. top-left.
  • The high rated wines are concentrated in low density high alcohol i.e bottom-right.
  • Considering the regressionn lines for every quality stub, we find that the lines for the lower quality categories tend towards the left and they have steeper slope.

Reflection